Overview

Dataset Statistics

Number of Variables 15
Number of Rows 3.6328e+06
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 712077
Duplicate Rows (%) 19.6%
Total Size in Memory 1.1 GB
Average Row Size in Memory 322.1 B
Variable Types
  • Categorical: 3
  • Numerical: 12

Dataset Insights

jaro_distance is skewed Skewed
jaro_winkler_distance is skewed Skewed
overlap_coefficient_distance is skewed Skewed
generalized_jaccard_distance is skewed Skewed
tfidf_distance is skewed Skewed
soft_tfidf_distance is skewed Skewed
Dataset has 712077 (19.6%) duplicate rows Duplicates
ProductID has a high cardinality: 1709 distinct values High Cardinality
ProductID2 has a high cardinality: 1709 distinct values High Cardinality
same_product has constant length 1 Constant Length
jaro_winkler_distance has 1179320 (32.46%) zeros Zeros
  • 1
  • 2

Variables


ProductID

categorical

Approximate Distinct Count 1709
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 425177138

Length

Mean 52.0373
Standard Deviation 20.4398
Median 70
Minimum 11
Maximum 120

Sample

1st row amd ryzen 5 1600 b...
2nd row amd ryzen 5 1600 b...
3rd row amd ryzen 5 1600 b...
4th row amd ryzen 5 1600 b...
5th row amd ryzen 5 1600 b...

Letter

Count 105138772
Lowercase Letter 105138772
Space Separator 29853678
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 51330486
  • ProductID contains many words: 1140 words

ProductID2

categorical

Approximate Distinct Count 1709
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 425177138

Length

Mean 52.0373
Standard Deviation 20.4398
Median 50
Minimum 11
Maximum 120

Sample

1st row amd ryzen 5 1600 b...
2nd row amd ryzen 5 1600
3rd row amd ryzen 5 1600 b...
4th row amd ryzen 5 1600 y...
5th row amd ryzen 5 1600 b...

Letter

Count 105138772
Lowercase Letter 105138772
Space Separator 29853678
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 51330486
  • ProductID2 contains many words: 1140 words

same_product

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 239767176
  • The largest value (0) is over 113.38 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 1
2nd row 1
3rd row 1
4th row 1
5th row 1

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 3632836
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 113.38 times larger than the second largest value (1)
  • same_product has words of constant length

levenshtain_distance

numerical

Approximate Distinct Count 3008
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.7456
Minimum 0
Maximum 1
Zeros 2428
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • levenshtain_distance is skewed left (γ1 = -1.7739)

Quantile Statistics

Minimum 0
5-th Percentile 0.5312
Q1 0.6977
Median 0.7746
Q3 0.8256
95-th Percentile 0.8795
Maximum 1
Range 1
IQR 0.1279

Descriptive Statistics

Mean 0.7456
Standard Deviation 0.1179
Variance 0.01391
Sum 2.7088e+06
Skewness -1.7739
Kurtosis 5.064
Coefficient of Variation 0.1582
  • levenshtain_distance is not normally distributed (p-value 4.650831815050968e-05)
  • levenshtain_distance has 158132 outliers

needleman_wunsch_distance

numerical

Approximate Distinct Count 8341
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.9199
Minimum 0
Maximum 1.3971
Zeros 2428
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • needleman_wunsch_distance is skewed left (γ1 = -0.7324)

Quantile Statistics

Minimum 0
5-th Percentile 0.622
Q1 0.8302
Median 0.9273
Q3 1.0326
95-th Percentile 1.2021
Maximum 1.3971
Range 1.3971
IQR 0.2024

Descriptive Statistics

Mean 0.9199
Standard Deviation 0.1779
Variance 0.03164
Sum 3.3417e+06
Skewness -0.7324
Kurtosis 2.0668
Coefficient of Variation 0.1934
  • needleman_wunsch_distance is not normally distributed (p-value 0.005099709782856804)
  • needleman_wunsch_distance has 102844 outliers

affine_gap_distance

numerical

Approximate Distinct Count 238318
Approximate Unique (%) 6.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.8223
Minimum 0
Maximum 1.1382
Zeros 2428
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • affine_gap_distance is skewed left (γ1 = -1.2684)

Quantile Statistics

Minimum 0
5-th Percentile 0.5695
Q1 0.7568
Median 0.8455
Q3 0.9154
95-th Percentile 1.0181
Maximum 1.1382
Range 1.1382
IQR 0.1586

Descriptive Statistics

Mean 0.8223
Standard Deviation 0.1431
Variance 0.02049
Sum 2.9874e+06
Skewness -1.2684
Kurtosis 3.2385
Coefficient of Variation 0.1741
  • affine_gap_distance is not normally distributed (p-value 0.002156743064323117)
  • affine_gap_distance has 126408 outliers

smith_waterman_distance

numerical

Approximate Distinct Count 2896
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.8129
Minimum 0
Maximum 0.98
Zeros 2428
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • smith_waterman_distance is skewed left (γ1 = -2.2734)

Quantile Statistics

Minimum 0
5-th Percentile 0.6087
Q1 0.7755
Median 0.8427
Q3 0.8864
95-th Percentile 0.9315
Maximum 0.98
Range 0.98
IQR 0.1109

Descriptive Statistics

Mean 0.8129
Standard Deviation 0.1135
Variance 0.01289
Sum 2.9532e+06
Skewness -2.2734
Kurtosis 8.1729
Coefficient of Variation 0.1397
  • smith_waterman_distance is not normally distributed (p-value 3.870378694800243e-06)
  • smith_waterman_distance has 199158 outliers

jaro_distance

numerical

Approximate Distinct Count 327256
Approximate Unique (%) 9.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.9899
Minimum 0.9091
Maximum 0.9969
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_distance is skewed left (γ1 = -2.9449)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.9826
Q1 0.9886
Median 0.991
Q3 0.9924
95-th Percentile 0.994
Maximum 0.9969
Range 0.08776
IQR 0.003787

Descriptive Statistics

Mean 0.9899
Standard Deviation 0.004094
Variance 1.6762e-05
Sum 3.596e+06
Skewness -2.9449
Kurtosis 18.7039
Coefficient of Variation 0.004136
  • jaro_distance is not normally distributed (p-value 1.5410793962080633e-14)
  • jaro_distance has 206042 outliers

jaro_winkler_distance

numerical

Approximate Distinct Count 170247
Approximate Unique (%) 4.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.3004
Minimum 0
Maximum 0.6941
Zeros 1179320
Zeros (%) 32.5%
Negatives 0
Negatives (%) 0.0%
  • jaro_winkler_distance is skewed left (γ1 = -0.5715)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0.4044
Q3 0.4684
95-th Percentile 0.5359
Maximum 0.6941
Range 0.6941
IQR 0.4684

Descriptive Statistics

Mean 0.3004
Standard Deviation 0.2145
Variance 0.046
Sum 1.0913e+06
Skewness -0.5715
Kurtosis -1.4456
Coefficient of Variation 0.714
  • jaro_winkler_distance is not normally distributed (p-value 5.031660985922888e-21)

overlap_coefficient_distance

numerical

Approximate Distinct Count 121
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.8032
Minimum 0
Maximum 1
Zeros 9598
Zeros (%) 0.3%
Negatives 0
Negatives (%) 0.0%
  • overlap_coefficient_distance is skewed left (γ1 = -0.9517)

Quantile Statistics

Minimum 0
5-th Percentile 0.5
Q1 0.7
Median 0.8333
Q3 1
95-th Percentile 1
Maximum 1
Range 1
IQR 0.3

Descriptive Statistics

Mean 0.8032
Standard Deviation 0.1833
Variance 0.03358
Sum 2.918e+06
Skewness -0.9517
Kurtosis 0.9156
Coefficient of Variation 0.2281
  • overlap_coefficient_distance is not normally distributed (p-value 2.034868339853162e-19)
  • overlap_coefficient_distance has 27728 outliers

generalized_jaccard_distance

numerical

Approximate Distinct Count 252
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.9109
Minimum 0
Maximum 1
Zeros 2528
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • generalized_jaccard_distance is skewed left (γ1 = -2.2154)

Quantile Statistics

Minimum 0
5-th Percentile 0.75
Q1 0.8696
Median 0.9333
Q3 1
95-th Percentile 1
Maximum 1
Range 1
IQR 0.1304

Descriptive Statistics

Mean 0.9109
Standard Deviation 0.09603
Variance 0.009222
Sum 3.3091e+06
Skewness -2.2154
Kurtosis 10.2656
Coefficient of Variation 0.1054
  • generalized_jaccard_distance is not normally distributed (p-value 7.804555728301963e-17)
  • generalized_jaccard_distance has 99732 outliers

tfidf_distance

numerical

Approximate Distinct Count 2930
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.9464
Minimum 0
Maximum 1
Zeros 2430
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • tfidf_distance is skewed left (γ1 = -4.0795)

Quantile Statistics

Minimum 0
5-th Percentile 0.8385
Q1 0.9235
Median 0.9629
Q3 1
95-th Percentile 1
Maximum 1
Range 1
IQR 0.07647

Descriptive Statistics

Mean 0.9464
Standard Deviation 0.06703
Variance 0.004492
Sum 3.4382e+06
Skewness -4.0795
Kurtosis 37.023
Coefficient of Variation 0.07082
  • tfidf_distance is not normally distributed (p-value 1.484110393819857e-15)
  • tfidf_distance has 120456 outliers

soft_tfidf_distance

numerical

Approximate Distinct Count 1915827
Approximate Unique (%) 52.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.9927
Minimum 0.9091
Maximum 1
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • soft_tfidf_distance is skewed left (γ1 = -2.6665)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.9867
Q1 0.9914
Median 0.9933
Q3 0.9949
95-th Percentile 0.9969
Maximum 1
Range 0.09091
IQR 0.003497

Descriptive Statistics

Mean 0.9927
Standard Deviation 0.003567
Variance 1.2722e-05
Sum 3.6062e+06
Skewness -2.6665
Kurtosis 20.0914
Coefficient of Variation 0.003593
  • soft_tfidf_distance is not normally distributed (p-value 6.357814619577697e-14)
  • soft_tfidf_distance has 166466 outliers

partial_ration_distance

numerical

Approximate Distinct Count 96
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.582
Minimum 0
Maximum 0.95
Zeros 5824
Zeros (%) 0.2%
Negatives 0
Negatives (%) 0.0%
  • partial_ration_distance is skewed left (γ1 = -1.0744)

Quantile Statistics

Minimum 0
5-th Percentile 0.32
Q1 0.52
Median 0.61
Q3 0.68
95-th Percentile 0.76
Maximum 0.95
Range 0.95
IQR 0.16

Descriptive Statistics

Mean 0.582
Standard Deviation 0.1374
Variance 0.01888
Sum 2.1141e+06
Skewness -1.0744
Kurtosis 1.5815
Coefficient of Variation 0.2361
  • partial_ration_distance has 151532 outliers

bag_distance_distance

numerical

Approximate Distinct Count 3201
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 58125376
Mean 0.5367
Minimum 0
Maximum 0.9254
Zeros 2434
Zeros (%) 0.1%
Negatives 0
Negatives (%) 0.0%
  • bag_distance_distance is skewed left (γ1 = -0.0591)

Quantile Statistics

Minimum 0
5-th Percentile 0.3108
Q1 0.4382
Median 0.5362
Q3 0.64
95-th Percentile 0.7778
Maximum 0.9254
Range 0.9254
IQR 0.2018

Descriptive Statistics

Mean 0.5367
Standard Deviation 0.1428
Variance 0.02038
Sum 1.9496e+06
Skewness -0.05909
Kurtosis -0.2144
Coefficient of Variation 0.266
  • bag_distance_distance is not normally distributed (p-value 0.00033589719872870183)
  • bag_distance_distance has 13022 outliers

Interactions

Correlations

Missing Values